Automatic debiased machine learning (autoDML) for estimating ATE: bringing autoDML into applied practice

Tong Chen, Stijn Vansteelandt, Margarita Moreno-Betancur

MCRI & UniMelb, UGent, MCRI & UniMelb

Motivation

  • ATE is a central causal estimand in clinical and epidemiological research

  • A substantial portion of biostatistics is devoted to analysing observational data.

    • Treatment is not randomised (Confounding is unavoidable)
    • Correct model specification is difficult in complex data settings
  • Existing singly robust causal inference methods face practical challenges: sensitive to model misspecification or unstable under near-positivity violation

Machine learning?

  • ML is increasingly used

    • model misspecification is likely and hard to diagnose
    • the analysis can be more objective by specifying the ML models
  • But our estimands are causal effects e.g. ATE

  • Naive “plug-in ML” for ATE can be,

    • biased (oversmoothing or throwing out important confounders)
    • no valid inference

Causal ML

  • The goal is to estimate the causal parameter using ML
    • and obtain valid confidence intervals,
    • while allowing flexible ML methods to estimate some nuisance functions.
  • This is particularly attractive when:
    • we have high-dimensional confounders, and/or
    • we want non-parametric and flexible models.

History of causal ML

  • Semiparametric efficiency, AIPW, efficient influence functions
    • Foundations in 1980s–1990s: Pfanzagl (1982), Newey (1990), Robins, Rotnitzky & Zhao (1994), van der Vaart (1991)
  • Targeted Learning / TMLE
    • van der Laan and co-authors used this theory to build plug-in estimators that incorporate machine learning (TMLE)
      (van der Laan & Rubin, 2006/2008; van der Laan & Rose, 2011/2014)

History of causal ML

  • Double / Debiased Machine Learning (DML)
    • Chernozhukov, Newey, Robins, and co-authors popularised orthogonal scores
      • sample splitting under weaker conditions
        (Robins et al., 2008; Chernozhukov et al., 2018)
  • autoDML
    • Builds on this line of work, Chernozhukov, Newey, …, automates the construction of riesz representers for a large class of parameters.

Notation - for ATE

  • Observed data: for each observation \(W\),

    • Outcome: \(Y\)
    • Binary treatment: \(D \in \{0,1\}\)
    • Covariates: \(Z\)
  • Let \(X = (D, Z)\) and \(W = (Y, X) = (Y, D, Z)\)

  • Define the outcome regression function \(\gamma\) by \[ \gamma(d, z) = \mathbb{E}[Y \mid D = d, Z = z], \] and let \(\gamma_0\) denote the true regression function.

Define ATE based on linear moment functionals

  • Assume the parameter of interest \(\theta\) is defined based on linear moment functionals \[ \theta = E[m(W,\gamma)] = E[\gamma(1, Z) - \gamma(0, Z)]. \] where \(\gamma(1, Z) = E(Y| D=1,Z)\)

Debiasing

  • The debiased version of the moment is of the form:

\[ \theta = E\bigl[ m(W; \gamma) \;+\; \alpha(X)\,\bigl(Y - \gamma(X)\bigr) \bigr]. \]

  • We want to find the correct term of \(\alpha\) that removes the first order plugin bias (Robins et al 1994, Chernozhukov et al 2018)

Riesz representer \(\alpha(X)\)

  • The function \(\alpha\) is the Riesz representer of the linear functional \(E[m(W; \gamma)]\).

  • We focus on problems where there exists a square-integrable random variable \(\alpha(X)\) such that: \[ E\bigl[m(W; \gamma)\bigr] \;=\; E\bigl[\alpha(X)\,\gamma(X)\bigr], \quad \text{for all } \gamma \text{ with } E\bigl[\gamma(X)^2\bigr] < \infty. \]

  • By the Riesz representation theorem, such an \(\gamma(X)\) exists if and only if \(E[m(W; \gamma)]\) is a continuous linear functional of \(\gamma\).

Asymptotic theory

  • Let \(\gamma_0\) and \(\alpha_0\) be the true nuisance functions, and define the score \[ \phi(W) \;=\; m\bigl(W;\gamma_0\bigr) \;+\; \alpha_0(X)\,\bigl(Y - \gamma_0(X)\bigr) \;-\; \theta_0. \]
  • Under standard regularity conditions:
      1. Neyman orthogonality of the score,
      1. Sample-splitting for \(\hat{\gamma}\) and \(\hat{\alpha}\),
      1. Rate condition We have \[ \sqrt{n}\,\bigl(\hat{\theta} - \theta_0\bigr) \;\xrightarrow{d}\; N\!\left(0,\,\sigma^2\right), \]

Estimate Riesz representer

  • Traditional approach: we need to derive the explicit form of Riesz representer

  • For ATE, the RR is

\[ \begin{aligned} \frac{D_i}{\hat{\pi}_0(Z_i)} - \frac{(1 - D_i)}{1 - \hat{\pi}_0(Z_i)} \end{aligned} \]

  • Derive the Riesz representer directly?

  • estimate \(\widehat{\frac{D}{\pi_0(Z)}}\) rather than \(\frac{D}{\widehat{\pi_0(Z)}}\)

Riesz Regression & autoDML

  • the Riesz representer is the minimiser of the loss function \[ \begin{aligned} \alpha_{0} &= \text{argmin}_{\alpha} \,E\Bigl[\bigl(\alpha(X) - \alpha_{0}(X)\bigr)^2\Bigr] \\[6pt] &= \text{argmin}_{\alpha} \,E\Bigl[ \alpha(X)^2 \;-\; 2\,\alpha_{0}(X)\,\alpha(X) \;+\; \alpha_{0}(X)^2 \Bigr] \\[6pt] &= \text{argmin}_{\alpha} \,E\Bigl[ \alpha(X)^2 \;-\; 2\,m\bigl(W; \alpha\bigr)\Bigr] \\ & = \text{argmin}_{\alpha} \,E\Bigl[ \alpha\bigl(D_i, Z_i\bigr)^2-2\bigl(\alpha(1, Z_i) - \alpha(0, Z_i)\bigr) \Bigr]. \end{aligned} \]

  • Using this \(\alpha\) to do debiasing is called autoDML

Implementation

  • Partition the set of data indices \(1,\ldots,n\) into \(L\) disjoint subsets of about equal size \(\{I_\ell\}_{\ell=1}^L\).

  • For each data fold \(\ell = 1,\ldots,L\):

    • Estimate \(\hat{\gamma}_\ell \in \mathcal{G}_n\) as a nonparametric regression of \(Y\) on \(X\), using observations not in \(I_\ell\).

    • Estimate the debiasing function \(\hat{\alpha}_\ell\) using observations not in \(I_\ell\) by minimizing a sample version of the objective function \[ \hat{\alpha}_\ell = \arg\min_{\alpha \in \mathcal{A}_n} \Biggl\{ \sum_{i \in I_\ell} \Bigl[ -\,2\,m(W_i;\alpha) + \alpha(X_i)^2 \Bigr] + \Lambda_r(\alpha) \Biggr\}, \]

Implementation

  • Estimate the parameter of interest using the cross-fitting and debiasing function in the moment function of equation: \[ \hat{\theta} = \frac{1}{n}\sum_{\ell=1}^L \sum_{i \in I_\ell} \Bigl[ m\bigl(W_i;\hat{\gamma}_\ell\bigr) + \hat{\alpha}_\ell\bigl(X_i\bigr)\bigl\{Y_i - \hat{\gamma}_\ell\bigl(X_i\bigr)\bigr\} \Bigr]. \]

  • Estimate the standard error of \(\hat{\theta}\) as \(\sqrt{\hat{V}/n}\), where \[ \hat{V} = \frac{1}{n}\sum_{\ell=1}^L \sum_{i \in I_\ell} \Bigl[ m\bigl(W_i;\hat{\gamma}_\ell\bigr) + \hat{\alpha}_\ell\bigl(X_i\bigr)\bigl\{Y_i - \hat{\gamma}_\ell\bigl(X_i\bigr)\bigr\} - \hat{\theta} \Bigr]^2. \]

Implementation using neural network

  • Since \(D \in \{0,1\}\), we can decompose the Riesz representer as
    \[ \alpha(D,Z) \;=\; D \,\alpha(1,Z) \;+\; (1-D)\,\alpha(0,Z). \]

  • This suggests we can use Two-headed MLP

    • Use a common feature network for \(Z\)
    • Output two heads: one for \(\alpha(1,Z)\) and one for \(\alpha(0,Z)\)

Simulation

  • We generate 1,000 datasets, each with 2,000 observations.

  • Covariates:\(Z_1, Z_2, Z_3 \sim N(0,1).\)

  • True propensity:\(\pi(Z) = \operatorname{expit}\!\left( (2Z_1 - Z_2 + 0.5 Z_3)\right).\)

  • Treatment variable:\(A \sim \mathrm{Bernoulli}\bigl(\pi (Z))\bigr).\)

  • Outcome model with known ATE:\(Y = 2 A + Z_1 - Z_2 + 0.5 Z_3 + \varepsilon,\qquad\varepsilon\sim N(0,1).\)

Estimation methods

  • We compare autoDML, DML, and TMLE.

  • For DML and TMLE:

    • Both the outcome regression and the propensity score are estimated using the correctly specified models.
  • For autoDML:

    • The outcome regression is correctly specified.

    • The Riesz representer is estimated using a two-headed MLP, with a shared 4 common layers and two separate output heads for \(\alpha(1, Z)\) and \(\alpha(0, Z)\).

Results

Method ATE Mean SD MSE Coverage
TML 1.997 0.088 0.008 90%
AIPW 1.998 0.142 0.020 97%
autoDML 1.997 0.068 0.005 93%

Summary & Future work

  • autoDML automatically learns the Riesz representer. This greatly simplifies implementation for complex causal parameters.

  • More stable than standard AIPW estimator

  • Simple simulations show that learned Riesz weights reduce variance relative to AIPW & TMLE with plug-in models.

  • Future work will evaluate autoDML in more realistic simulation studies that are directly motivated by cohort studies.

Thank you!